December 2021

This project was built during my studies in the MSc Statistics for the module "Managing and Visualizing Data".

It was a group project among 3 people and the contributions of each are shown at the end.

US CAR ACCIDENTS

In [18]:
from IPython.display import Image
Image(filename='car-image.jpg', width =600, height=300)
Out[18]:

Project objective

The objectives of this project are to start exploring and analysing the accidents that happened in the United States in February and March of 2016. The purpose of this is to get insights on the circumstances under which an accident is more prone to happen, its severity and some of the consequences that an accident has in the roads of US. This analysis is intended to be used by the authorites, to prevent accidents by understanding where their budget and efforts should focus on, in terms of road infrastructures and decisions that should be taken on the day based on the weather forecasts.

Furthermore, modelling techniques are used to predict the severity of the accident. A game theoretic approach is used to explain gthe output of one of the models, and the most important features are revealed. Linear regression is used to hypothesise whether there is a correlation between the population of a state and the number of accidents.

The main features that are taken into account and analysed in relation to the accidents and their severities are road features at the point of the accident, day/time, weather conditions, city and state characteristics and traffic of the road of interest.

Background of the datasets

The datasets used in this project are coming from the following sources:

  1. US accidents: is the core dataset and consists of accidents from 2016-2020, which we narrowed down to February and March 2016 data. It also includes weather conditions and road features. [1, 2, 3]
  2. Traffic data retrieved from API calls on Google Maps . The data is about the time needed for a car to go from the Start point of the accident to the End point of the accident at the time of the accident and 30 mins before.
    (More informations about the api used can be found here: https://app.outscraper.com/api-docs)
  3. Web-scrapped City Population data from Wikipedia: consists of US population data by state, based upon the censuses undertaken in 2010 and 2020.
  4. Web-scrapped Vehicle Owneship data from Wikipedia: consists of vehicles per per 1000 people within each state.
  5. Web-scrapped State names to link with State Abreviations.

To address the asserted objectives our analysis covers characteristics that could be also grouped in the below categories:

  • features known before an accident happens and which are used to also predict the severity of an accident.
  • features that occur at the time of the accident, which enlighten us on the way an accident is treated by the authorities or people and gives indications on things that can be taken into consideration
  • features that occur after the accident, like describing an accident and assigning a severity to it.

Table of Contents

1. Setup Link to the destination

2. Data Preparation Link to the destination
2.1 Read US Data Link to the destination
2.2 Initial Data Exploration Link to the destination
2.3 Data Cleansing Link to the destination

3. Data Exploration Link to the destination
3.1 Weather and population data Link to the destination
3.2 Datetime data Link to the destination
3.3 Location data Link to the destination
3.4 Vehicle Ownership data Link to the destination
3.5 Traffic Google Maps data Link to the destination
3.6 Accident Description data Link to the destination

4. Dimensionality Reduction Link to the destination
4.1 City and State variables Link to the destination
4.2 Weather Condition variable Link to the destination

5. Model Evaluation Link to the destination
5.1 Multi-class classification Link to the destination
5.2 Binary classification Link to the destination
5.3 Feature Importance Link to the destination

6. Main findings Link to the destination

7. References Link to the destination

8. Individual Contribution Link to the destination

1. Setup

Import the neccessary packages for the project:

In [19]:
import pandas as pd
#from outscraper import ApiClient
from datetime import datetime
import numpy as np
import json
import math
import missingno as msno
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import calendar
from sklearn import (manifold, decomposition)

# libraries for model selection and evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc, confusion_matrix, precision_recall_curve, average_precision_score, accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import label_binarize
import shap

# libraries for model fitting
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost

import tabulate
#for webscrapping
from bs4 import BeautifulSoup

#for visualising interactive maps
import os
import folium
from folium import plugins
import rioxarray as rxr
import earthpy as et
import earthpy.spatial as es
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
import plotly.graph_objects as go
import plotly.express as px

#for text analysis
from sklearn. feature_extraction. text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from matplotlib.colors import LinearSegmentedColormap
import colorama
from colorama import Fore

import yaml #fpr printing a dictionary in a clean way
from IPython.core.display import HTML as Center #to align all the plots in this notebook in the center
from adjustText import adjust_text
from xgboost import XGBClassifier
import xgboost

Supress code warnings to make the notebook cleaner:

In [20]:
import warnings
warnings.filterwarnings('ignore')

Expand the output in the notebook, so that the user doesn't need to scroll:

In [21]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Code to align all the plots in this notebook in the center

In [22]:
Center(""" <style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style> """)
Out[22]:

2. Data Preparation

2.1 Read US Accidents data

Read the US Accidents dataset using the csv file found on the Kaggle website [1]. The csv file was first loaded locally.

Note: If you need to run the code below, please download the csv file from kaggle, as it was too large to push into github.

The file has 3 datetime variables, which are read with the equivallent format.

In [23]:
accidents_df = pd.read_csv ('US_Accidents_Dec20_updated.csv', parse_dates =['Start_Time', 'End_Time', 'Weather_Timestamp'])
accidents_df.head()
Out[23]:
ID Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Description ... Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-2716600 3 2016-02-08 00:37:08 2016-02-08 06:37:08 40.10891 -83.09286 40.11206 -83.03187 3.230 Between Sawmill Rd/Exit 20 and OH-315/Olentang... ... False False False False False False Night Night Night Night
1 A-2716601 2 2016-02-08 05:56:20 2016-02-08 11:56:20 39.86542 -84.06280 39.86501 -84.04873 0.747 At OH-4/OH-235/Exit 41 - Accident. ... False False False False False False Night Night Night Night
2 A-2716602 2 2016-02-08 06:15:39 2016-02-08 12:15:39 39.10266 -84.52468 39.10209 -84.52396 0.055 At I-71/US-50/Exit 1 - Accident. ... False False False False False False Night Night Night Day
3 A-2716603 2 2016-02-08 06:15:39 2016-02-08 12:15:39 39.10148 -84.52341 39.09841 -84.52241 0.219 At I-71/US-50/Exit 1 - Accident. ... False False False False False False Night Night Night Day
4 A-2716604 2 2016-02-08 06:51:45 2016-02-08 12:51:45 41.06213 -81.53784 41.06217 -81.53547 0.123 At Dart Ave/Exit 21 - Accident. ... False False False False False False Night Night Day Day

5 rows × 47 columns

2.2 Initial Data Exploration

For the purpose of Data Cleansing, some initial data exploration is taking place. This exploration gives direction to the data cleansing needs.

Actions taken:

  • Data Exploration
    • Descriptive Statistics for numeric data
    • Frequency tables from categorical data
    • Statistics for missing values
    • Check for duplicated records
  • Data Manipulation:
    • From the Start_Time column extract the year, month and hour and create 3 respective columns
    • From the initial dataset select only accidents that took place in 2016, using the Start_Time column.
    • Rename some variables that in the original dataset had symbols that would complicate the code, like parenthesis.
      For example, Temperature(F) becomes Temperature_F.
    • Drop 3 variables that have more than 78% missing values (Precipitation_in, Wind_Chill_F, Number)
    • Drop the Country column that has only 1 distinct value populated in all records.
    • Imputation of the numeric variables separately for each city, using intepolation for both directions, with the time as the index
    • Transform Temperature from Fahrenheit to Celcius.
    • Create dummy variables for the categorical variables that have a granularity less than 5. In particular, dummy variables are created for the variables: 'Side', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight' , 'Timezone'

Get a view of all the columns and their type.

In [24]:
accidents_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1516064 entries, 0 to 1516063
Data columns (total 47 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   ID                     1516064 non-null  object        
 1   Severity               1516064 non-null  int64         
 2   Start_Time             1516064 non-null  datetime64[ns]
 3   End_Time               1516064 non-null  datetime64[ns]
 4   Start_Lat              1516064 non-null  float64       
 5   Start_Lng              1516064 non-null  float64       
 6   End_Lat                1516064 non-null  float64       
 7   End_Lng                1516064 non-null  float64       
 8   Distance(mi)           1516064 non-null  float64       
 9   Description            1516064 non-null  object        
 10  Number                 469969 non-null   float64       
 11  Street                 1516064 non-null  object        
 12  Side                   1516064 non-null  object        
 13  City                   1515981 non-null  object        
 14  County                 1516064 non-null  object        
 15  State                  1516064 non-null  object        
 16  Zipcode                1515129 non-null  object        
 17  Country                1516064 non-null  object        
 18  Timezone               1513762 non-null  object        
 19  Airport_Code           1511816 non-null  object        
 20  Weather_Timestamp      1485800 non-null  datetime64[ns]
 21  Temperature(F)         1473031 non-null  float64       
 22  Wind_Chill(F)          1066748 non-null  float64       
 23  Humidity(%)            1470555 non-null  float64       
 24  Pressure(in)           1479790 non-null  float64       
 25  Visibility(mi)         1471853 non-null  float64       
 26  Wind_Direction         1474206 non-null  object        
 27  Wind_Speed(mph)        1387202 non-null  float64       
 28  Precipitation(in)      1005515 non-null  float64       
 29  Weather_Condition      1472057 non-null  object        
 30  Amenity                1516064 non-null  bool          
 31  Bump                   1516064 non-null  bool          
 32  Crossing               1516064 non-null  bool          
 33  Give_Way               1516064 non-null  bool          
 34  Junction               1516064 non-null  bool          
 35  No_Exit                1516064 non-null  bool          
 36  Railway                1516064 non-null  bool          
 37  Roundabout             1516064 non-null  bool          
 38  Station                1516064 non-null  bool          
 39  Stop                   1516064 non-null  bool          
 40  Traffic_Calming        1516064 non-null  bool          
 41  Traffic_Signal         1516064 non-null  bool          
 42  Turning_Loop           1516064 non-null  bool          
 43  Sunrise_Sunset         1515981 non-null  object        
 44  Civil_Twilight         1515981 non-null  object        
 45  Nautical_Twilight      1515981 non-null  object        
 46  Astronomical_Twilight  1515981 non-null  object        
dtypes: bool(13), datetime64[ns](3), float64(13), int64(1), object(17)
memory usage: 412.1+ MB

Initial Data Transformation

From the Start_Time column extract the year, month and hour and create 3 respective columns to use them as additional features.

In [25]:
accidents_df['year'] = pd.DatetimeIndex(accidents_df['Start_Time']).year
accidents_df['month'] = pd.DatetimeIndex(accidents_df['Start_Time']).month
accidents_df['hour'] = pd.DatetimeIndex(accidents_df['Start_Time']).hour

Filter the dataframe to contain only accidents that happened in February or March of 2016.

In [26]:
accidents_2016_df = accidents_df[ (accidents_df['year']==2016) & (accidents_df['month'].isin([2,3]))]
print('The new dataframe has ' + str(accidents_2016_df.shape[0]) + ' rows.')
The new dataframe has 3018 rows.

Rename columns that in the original dataset have symbols that would complicate the code, like parenthesis.

In [27]:
accidents_2016_df = accidents_2016_df.rename({'Temperature(F)': 'Temperature_F', 
                                              'Visibility(mi)': 'Visibility_mi', 
                                              'Distance(mi)': 'Distance_mi',
                                              'Wind_Chill(F)': 'Wind_Chill_F',
                                              'Humidity(%)': 'Humidity_prc',
                                              'Pressure(in)': 'Pressure_in',
                                              'Wind_Speed(mph)': 'Wind_Speed_mph',
                                              'Precipitation(in)': 'Precipitation_in'
                                             }, 
                                             axis='columns')

Web-Scrapping to get full names of the states, using their Abbreviations

In [28]:
accidents_2016_df = accidents_2016_df.rename(columns={'State':'State abbreviation'})

# Change abbreviation to full name of states by websrapping. This is faster than manually inputting the names and abbreviations
res = requests.get("https://abbreviations.yourdictionary.com/articles/state-abbrev.html")
soup = BeautifulSoup(res.content, "html.parser")
states = soup.find('table')

df1 = pd.read_html(str(states))
df1 = pd.DataFrame(df1[0])

df1 = df1.iloc[1:,:2]
df1 = df1.rename(columns={0:'State', 1:'State abbrev'})

accidents_2016_df = accidents_2016_df.merge(df1, how='left', left_on = 'State abbreviation', right_on='State abbrev')

accidents_2016_df.drop(columns = 'State abbrev', inplace = True)

Duplicates, Descriptive Statistics for Numeric Data

Check whether there are any duplicates. The below result is zero, hence there are no duplicates in the dataset

In [29]:
accidents_2016_df.duplicated().sum()
Out[29]:
0

Description of categorical data

Get the number of distinct values for each character column, in order to decide on the way they will be treated, based on wheter their cardinality level is low or high.

In [30]:
#Get a list of the character columns
character_cols = accidents_2016_df.select_dtypes(include=['object']).columns

#Get the number of distinct values for each character column
dic = {}
for c in character_cols:
    dic[c] = accidents_2016_df[c].nunique() 

#Sort them by ascending order
dic = dict(sorted(dic.items(), key=lambda item: item[1]))

#Print the dictionary nicely
print(yaml.dump(dic, sort_keys=False, default_flow_style=False))
Country: 1
Side: 2
Sunrise_Sunset: 2
Civil_Twilight: 2
Nautical_Twilight: 2
Astronomical_Twilight: 2
Timezone: 3
State abbreviation: 20
State: 20
Wind_Direction: 23
Weather_Condition: 23
County: 216
Airport_Code: 263
City: 813
Street: 992
Zipcode: 1521
Description: 2452
ID: 3018

Display frequency bar plots for the categorical variables that were found to have a cardinality of 3 or less.

In [31]:
low_cardinality_cols = ['Side', 'Sunrise_Sunset','Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight','Timezone']

fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(15, 10), sharey=True)
for i, c in enumerate(low_cardinality_cols):  
    low_card_df = accidents_2016_df.groupby(c)[c].agg('count')
    
    if i <3:
        low_card_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, ax=axes[0,i], fontsize =12)
        axes[0,i].set_xlabel(c, fontsize =15);
    else:
        low_card_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, ax=axes[1,i-3], fontsize =12)
        axes[1,i-3].set_xlabel(c, fontsize =15);
fig.subplots_adjust(hspace = 0.5); #adjust the plots so that the first row has some space from the second row

plt.title('Frequency bar plots',fontsize=24, y =2.9, x = -0.75);
plt.suptitle('(for categorical columns with low cardinality)', y=0.98, fontsize=15);

#Add separate plot for the country
fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(4.5,4.5))
low_card_df = accidents_2016_df.groupby('Country')['Country'].agg('count')
low_card_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4,  fontsize =12);
axes.set_xlabel('Country', fontsize =15);

Findings:

From these results we see there are 5 variables (Side, Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight) with a granularity level of 2. We will transform these variables to dummy variables so that they become numeric. Transforming a categorical variable of level 2 cardinality into dummy variable doesn't increase the complexity of the dataset and hence of the models that will run on it, as only 1 dummy variable will be created. Timezone has also a low cardinality (3), so will converted to 3 dummy variables, as for interpretability reasons we wouldn't like to use a reference group for this variable.

Country column has the same value in all records (US), hence we will drop this column, as it doesn't provide any information.

Missing values

Plot the percentage of missing values for each column that has missing values.

In [32]:
# Create a function that counts the number of missing values and the percentage, stores them in a dataframe 
# and sorts it in descending order so that the columns with the most missing values appear first

def calculate_missing_values(input_df):
    #Count the number of missing values for each column and store the results in a named panda Series
    missings_count = pd.Series(input_df.apply(lambda x: x.isnull().sum()), name ='missings_count')

    #Count the percentage of missing values for each column and store the results in a named panda Series
    missings_perc = pd.Series(input_df.isnull().sum() * 100 / len(input_df), name ='missings_perc')

    column_type = pd.Series(input_df.dtypes, name ='column_type')

    #Concatinate the previous panda Series in a dataframe
    missings_df = pd.concat([column_type, missings_count, missings_perc], axis =1)

    #Sort the dataframe in decending order
    missings_df = missings_df.sort_values(by='missings_count', ascending =False)
    return missings_df

missings_df= calculate_missing_values(accidents_2016_df)



#Plot percentage of missing values for each column
missings_perc_ser = pd.Series(missings_df['missings_perc'], name ='missings_perc')

missings_df['missings_perc'][missings_perc_ser != 0].plot(kind= 'barh', 
                                                          xticks = [int(x) for x in np.linspace(0,100,11)],
                                                          title = 'Percentage of missing values', 
                                                          figsize = (9,6));

Observations:
From the bar plot above it is seen that 3 variables Wind_Chill_F, Number and Percipitation_in have missing values, more than 69%. We consider this a high percentage, hence we will drop these variables in the "Manipulation" part.

It is also seen that about 5 weather related variables Visibility_mi ,Humidity_prc ,Temperature_F, Pressure_in and Wind_Speed_mph have about 2% of missing values and 12% for the latter one. For these variables we will perform imputation using the time interpolation method, after grouping them by city, so that weather values are only interpolated based on the weather of the same city.

2.3 Data Cleansing

Imputation

Drop the 3 variables that have more than 69% null values, and the Country column that has only 1 distinct value populated in all records.

In [33]:
accidents_clean_temp_df = accidents_2016_df.drop(['Precipitation_in', 'Wind_Chill_F', 'Number', 'Country'], axis = 1)

Impute the missing records of the weather related variables, with the method of time interpolation in both directions, grouping the records by city, so that interpolation happens only within the same city. The reason for that is that the weather is highly possible to be similar within a few days within one city.

In [34]:
#Get all the city names that appear in the dataset, into one array
cities = accidents_2016_df['City'].unique()
num_cols_for_imputation = ["Wind_Speed_mph" ,"Visibility_mi" ,"Humidity_prc" ,"Temperature_F" ,"Pressure_in"]  

imputed_city_df = accidents_2016_df[['ID','Start_Time', 'City',"Wind_Speed_mph" ,"Visibility_mi" ,"Humidity_prc" ,"Temperature_F" ,"Pressure_in"]] #temp - the column selection is temp
imputed_city_df = imputed_city_df [imputed_city_df['City'].isin(cities)] #temp

#Store the initial dataframe in another object, which will be updated iteratively below
initial_df = imputed_city_df #accidents_2016_df

#Loop through all the city names:
for c in cities:
    imputed_city_df = accidents_2016_df[['ID','Start_Time', 'City',"Wind_Speed_mph" ,"Visibility_mi" ,"Humidity_prc" ,"Temperature_F" ,"Pressure_in"]] 
    
    #Create a dataframe that contains only records from a particular city, which will be used to interpolate the missing values
    imputed_city_df = imputed_city_df [imputed_city_df['City'] == c] 
    
    #Make the current index of the dataframe a column and name it as 'The_org_index'.
    #This is because later on we need to change the index of the column and then revert it back to the initial one.
    imputed_city_df['The_org_index'] = imputed_city_df.index
    
    #Set the index of the dataframe to be the 'Start_Time', because we want the interpolation to use it as an index. 
    imputed_city_df = imputed_city_df.set_index(['Start_Time'])

    #Interpolate the missing values in both directions
    imputed_city_df = imputed_city_df.interpolate(method='time', limit_direction ='both')

    #Set the index of the dataframe to the original index
    imputed_city_df = imputed_city_df.set_index(['The_org_index']) 


    for index, row in imputed_city_df.iterrows():

        id_new = imputed_city_df.loc[index, 'ID']
        
        for col in num_cols_for_imputation:
            temp_new = imputed_city_df.loc[index, col]

            initial_df.loc[initial_df['ID']==id_new, col] = temp_new

num_imputation_df = initial_df.set_index(['ID'])
num_imputation_df
Out[34]:
Start_Time City Wind_Speed_mph Visibility_mi Humidity_prc Temperature_F Pressure_in
ID
A-2716600 2016-02-08 00:37:08 Dublin 10.4 10.0 58.0 42.1 29.76
A-2716601 2016-02-08 05:56:20 Dayton 2.3 10.0 91.0 36.9 29.68
A-2716602 2016-02-08 06:15:39 Cincinnati 10.4 10.0 97.0 36.0 29.70
A-2716603 2016-02-08 06:15:39 Cincinnati 10.4 10.0 97.0 36.0 29.70
A-2716604 2016-02-08 06:51:45 Akron 9.2 10.0 55.0 39.0 29.65
... ... ... ... ... ... ... ...
A-2825985 2016-03-31 18:54:45 Crestwood 4.6 10.0 78.0 66.0 29.55
A-2825986 2016-03-31 19:23:49 Toledo 19.6 5.0 84.0 63.0 29.42
A-2825987 2016-03-31 19:35:23 Columbus 9.2 10.0 72.0 61.0 29.55
A-2825988 2016-03-31 19:55:22 Columbus 9.2 10.0 72.0 61.0 29.55
A-2825989 2016-03-31 20:20:50 Columbus 9.2 10.0 72.0 61.0 29.55

3018 rows × 7 columns

Combine the dataset with the imputed numerical columns with the rest of the columns

In [35]:
num_imputation_df2= num_imputation_df.drop(['Start_Time','City'], axis = 1)
imputed_cols = num_imputation_df2.columns

accidents_clean_temp_df2= accidents_clean_temp_df.set_index('ID')
accidents_clean_temp_df2 = accidents_clean_temp_df2.drop(imputed_cols, axis = 1)

accidents_clean_temp_df3 = pd.concat([num_imputation_df2, accidents_clean_temp_df2], join = 'inner', axis = 1)

Check whether there are any more missing values left by ploting the percentage of them:

In [36]:
missings_df2= calculate_missing_values(accidents_clean_temp_df3)

#Plot percentage of missing values for each column
missings_perc_ser = pd.Series(missings_df2['missings_perc'], name ='missings_perc')

missings_df2['missings_perc'][missings_perc_ser != 0].plot(kind= 'barh', 
                                                          xticks = [int(x) for x in np.linspace(0,100,11)],
                                                          title = 'Percentage of missing values after imputation', 
                                                          figsize = (9,6));

We can see that even though the missing percentages have decreased, there are still some missing values in the dataset.

It is observed that about 7 weather related variables have a similar missing percentage of about 1%. This brings up the question on whether these missing values appear on the same records. If this is the case we will decide to drop these records, since most of the weather related variables are missing. To figure out if this is the case, in the next step we visualize the distribution of the missing records for these 7 variables.

In [37]:
#Get the name of the columns, whose missing values percentage is between 0.7% to 5%:
missing_cols = list(missings_df2[missings_perc_ser != 0][missings_perc_ser < 5][missings_perc_ser > 0.7].index)

#Keep only the 7 variables that have around 1% of missing values and only the rows with at least 1 missing value. 
#This will facilitate the visualisation and help us focus on the records of interest.
missing_data = accidents_clean_temp_df3[missing_cols]
missing_data = missing_data[missing_data.isnull().any(axis=1)]

#Visualize the missing values from each record for the 7 variables that have around 1% of missing values.
#Before doing that, to facilitate the visualization, we sort the dataframe, so that the NaNs appear all together
missing_data = missing_data.sort_values(by=missing_cols, ascending =False)
msno.matrix(missing_data , labels = True);

Observations:

From the visualization above it can be seen that 151 records remain with missing values on the 7 variables. A quarter of those missing values, appear in the same records, except for Wind_Speed_mph, which has more missing values. This can be seen from the bottom part of the visualisation. Because of that, droppping the records, where all these 7 variables appear as missings will be our approach.

The remaining reconds with missing values (the ones at the top of the visualization) are very few and were not imputated with the intrpolation method above, because the City which they represent had no other weather values in previous or future days. For these reasons, we decide to also drop these remaining records.

Drop remaining rows with missing values and print the new number of records in the dataset.

In [38]:
accidents_clean_temp_df4= accidents_clean_temp_df3.dropna()
print('The remaining number of records in the dataset is ' + str(accidents_clean_temp_df4.shape[0]))
The remaining number of records in the dataset is 2867

Transform Fahrenheit to Celcius for the Temperature variable in the dataset, so that it is more meaningful to us. Therefore, instead of having the Temperature_F variable, we will have the Temperature_C variable.

For the transformation, the following equation is used.

\begin{eqnarray*} Celcius&=&\frac{5}{9}(Fahrenheit-32) \end{eqnarray*}
In [39]:
accidents_clean_temp_df4['Temperature_C'] = ((accidents_clean_temp_df4['Temperature_F'] -32) * (5/9)).round(decimals =1)
accidents_clean_temp_df5 = accidents_clean_temp_df4.drop('Temperature_F', axis =1)

Create dummy variables for the categorical variables that have a granularity less than 4, as decided in ther 'Exploration' part of the project.

For example, for the variable Side, the dummy variable Side_R is created, which is boolean and gets the value of 1, when the accident happend on teh Right side of the road.

For the variable Sunrise_Sunset, the new dummy variable Sunrise_Sunset_Day is created, which is boolean and the value of 1 represents accidents that happened during the day and 0 the ones during the night.

The rest of the variables were treated in a similar way.

In [40]:
dummies_df = pd.get_dummies(accidents_clean_temp_df5, columns=['Side',  'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight' , 'Timezone'], prefix_sep='_')

#For the categorical variables that had only 2 levels of granularity, drop one of the 2 dummy vaiables, in order to use it as a reference group. 
#This is because the 2 new dummy variables will be completely correlated, hence there will be an uneccessary complexity. 
dummies_df = dummies_df.drop(['Side_L','Sunrise_Sunset_Night', 'Civil_Twilight_Night', 'Nautical_Twilight_Night', 'Astronomical_Twilight_Night'], axis =1)

#Rename some of the created dummy variables, so that tey do not contain symbols that will need a special treat.
accidents_clean_df = dummies_df.rename({'Timezone_US/Central': 'Timezone_Central', 
                                              'Timezone_US/Eastern': 'Timezone_Eastern', 
                                              'Timezone_US/Pacific': 'Timezone_Pacific',
                                             }, 
                                             axis='columns')

3. Data Exploration

3.1 Weather and population data

Webscrapping

For this section, I utilised the BeautifulSoup package in order to scrap data via webscrapping from Wikipedia. This data is the population data of different states in America, which is useful for comparing the number of accidents in a state to the state population. The data came from the Census of 2010. I decided to use this instead of an estimate, as I thought the data would be more reliable.

In [41]:
url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population#cite_note-USCBGazetteer-4"
table_class= "wikitable sortable jquery-tablesorter"
response = requests.get(url)
print(response.status_code)

soup = BeautifulSoup(response.text, 'html.parser')
pop_table = soup.find('table', {'class':"wikitable sortable"})
df = pd.read_html(str(pop_table))
df = pd.DataFrame(df[0])
#print(df.head())
#display(df.head())

wiki_data = df.rename(columns= {"2020rank":"2020_Rank", "State[c]":"State", "2020census":"2020_Population", "2010census":"2010_Population"})
wiki_data.head()
200
Out[41]:
2020_Rank City State 2020_Population 2010_Population Change 2020 land area 2020 land area.1 2020 population density 2020 population density.1 Location
0 1 New York[d] New York 8804190 8175133 +7.69% 300.5 sq mi 778.3 km2 29,298/sq mi 11,312/km2 .mw-parser-output .geo-default,.mw-parser-outp...
1 2 Los Angeles California 3898747 3792621 +2.80% 469.5 sq mi 1,216.0 km2 8,304/sq mi 3,206/km2 34°01′N 118°25′W / 34.01°N 118.41°W
2 3 Chicago Illinois 2746388 2695598 +1.88% 227.7 sq mi 589.7 km2 12,061/sq mi 4,657/km2 41°50′N 87°41′W / 41.83°N 87.68°W
3 4 Houston Texas 2304580 2099451 +9.77% 640.4 sq mi 1,658.6 km2 3,599/sq mi 1,390/km2 29°47′N 95°23′W / 29.78°N 95.39°W
4 5 Phoenix Arizona 1608139 1445632 +11.24% 518.0 sq mi 1,341.6 km2 3,105/sq mi 1,199/km2 33°34′N 112°05′W / 33.57°N 112.09°W

The wikipedia data was scrapped successfully, as evident by the 200 status sign.

I load the data that was cleansed earlier, in order to perform further visualisations.

First, this section gives an overview of the US traffic accidents based on the weather conditions identified and groups them based on this. Secondly, certain variables regarding the weather conditions are selected in order to see whether there is a correlation between them- severity, visibility (miles), temperature (Celsius) and wind speed (miles per hour).

The correlation pair plot map reveals that there is a lack of strong correlation between the variables. Separately, each variable reveals a pattern. For severity, there are a greater number of accidents that have a low severity measure than a higher one (more 2’s than 3’s or 4’s). For visibility, there are greater number of accidents that have further away visibility than close visibility. For temperature ©, there is a broader variation and no clear distinction between number of accidents and one specific temperature. For wind speed, there are a greater number of accidents with a lower wind speed or mid wind-speed’s than high ones.

In [42]:
df = accidents_clean_df.copy()
df.reset_index( inplace = True)
In [43]:
sns.reset_orig() # don't change the style of plots

sev_df = df.groupby('Severity')['Severity'].agg('count')

event_df = df.groupby('Weather_Condition')['Weather_Condition'].agg('count')

fig, axes = plt.subplots(ncols=2, nrows=1, figsize=(15, 4))

sev_df.sort_values(ascending=False).head(20).plot(kind='bar', color='b', alpha=0.4, ax=axes[0])

axes[0].set_ylabel('Number of events')

event_df.sort_values().plot(kind='barh', color='b', alpha=0.4, ax=axes[1])

axes[1].set_xlabel('Number of events')
fig.subplots_adjust(wspace=0.8);

plt.show()

This plot reveals the number of severe events in February and March 2016 (based on our clean dataset), which shows that most accidents were of the '2' severity. There were similar numbers of accidents in the severity range of '3' or '4'. Similarly, in the second plot, this reveals the number of events in a certain weather condition described. It can be seen, suprisingly, that most accidents occurred under clear weather conditions. Overcast and mostly cloudly came in at second and third, which was hypothesised.

In [44]:
scatter = df[['Severity', 'Visibility_mi', 'Temperature_C', 'Wind_Speed_mph']]
sns.pairplot(scatter, kind='hist')
plt.show()

This pairplot shows the correlation between four variables- wind speed, temperature (C), visibility and severiyy. There is a lack of strong corerlation between either of the variables.

Maps Data

Objective 2: Visualising the US accidents in February and March 2016 by severity on an interactive map. Determine the correlation between severity of accidents and location.

From the interactive map, it is revealed that most accidents occurred on the west and east coast. In particular, severe accidents tended to occur on the west coast in the states of New York City. However, there were also accidents that occurred more centrally.

In [45]:
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

m = folium.Map(location=[37.0902, -95.7129], zoom_start =4)

# Display the map
m
Out[45]:
Make this Notebook Trusted to load map: File -> Trust Notebook

In this first section, I load an interactive map that is zoomable and shows the whole of USA. This lets users to explore the map before plotting the accidents data.

In [46]:
fig = px.scatter_geo(df,lat='Start_Lat',lon='Start_Lng', hover_name="ID",height=1000)
fig.show()

In this initial map, I plotted the accidents data by longitude and latitude and used the ID as their hover names. However I improve upon this map by utilising a different method with a different library, as this one does not represent the data clear enough.

In [47]:
fig1 = go.Figure(data=go.Scattergeo(
        lon = df['Start_Lng'],
        lat = df['Start_Lat'],
        text = df['Description'],
        mode = 'markers',
        marker = dict(
        size = 5,
        opacity = 0.8, 
        reversescale = True,
        autocolorscale = False,
        symbol= 'circle',
        line = dict(
        width=1,
        color= 'rgba(102,102,102)'),
        colorscale= 'blackbody',
        cmin =0,
        color = df['Severity'],
        cmax = df['Severity'].max(),
        colorbar_title= "Severity of Accidents")
        ))

fig1.update_layout(
        title = 'Most severe US Car Accidents<br>(Hover for description)',
        geo_scope='usa',
    height=1000
    )
fig1.show()

In this interactive map, I improved the data visualisation by adding a legend and having a different colour for each inicident severity. This revealed that most serious incidents (red and black) occurred in the North East and Mid East regions. Whereas the less severe accidents '2'(yellow) took place in the West Coast and East Coast.

Further Visualisation

In [48]:
df = df.set_index(pd.DatetimeIndex(df['Start_Time']))
monthly = df.resample('M').count()
yearly = df.resample('Y').count()
In [49]:
res_2016 = df[~(df['Start_Time'] > '2016-12-31')]
res_2017 = df[~(df['Start_Time'] < '2017-01-01')]
In [50]:
monthly_data = monthly[['ID']]
monthly_data.index.date
#monthly_data['Start_Time'] = pd.to_datetime(monthly_data['Start_Time']).dt.date
y = monthly_data['ID']
mylabels = monthly_data.index.date

plt.pie(y, labels = mylabels)
plt.show() 

This piechart reveals the number of accidents by month, which shows that more accidents occurred in March 2016 than February 2016.

Objective 3: Determine when accidents occurred and under which weather conditions, in order to see if there is a correlation.

From the pie chart representations, it is clear that more accidents occurred in March 2016 than February 2016. The weather condition for most accidents were clear at a ‘1062’ count, and the second most common weather condition was ‘overcast’ at ’553’ and the third most common weather condition was ‘mostly cloudy’ at ‘378’.

We use an interactive method to re-visualise the data.

In [51]:
plt.style.use('seaborn')
weather_condition = res_2016.groupby(["Weather_Condition"])["ID"].count().reset_index(name='Count')

fig = px.pie(weather_condition, values='Count', names='Weather_Condition',
             title='Accidents by Weather Condition (Feburary and March 2016)',
             hover_data=['Count'], labels={'Count':'Count'}, width= 800, height= 800)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

This plot depicts a much clear graph of the data and allows for an interactive feature, wehreby the user can hover to see the count of accidents in each category. It also has a useful legend and you can delete conditions too by clicking on them in the legend.

In [52]:
weather_condition.loc[weather_condition['Count'] < 500, 'Weather_Condition'] = 'Other conditions'
fig = px.pie(weather_condition, values='Count', names='Weather_Condition', title='Accidents by Weather Condition (February and March 2016)', width= 800, height = 800)
fig.show()

To split up the accidents a bit further, I chose to visualise the three groups of most common data- Clear, Overcast and Other conditions, in order to see which was most common and by what percentage.

Objective 4: Correlation between states and accidents.

There were a greater number of accidents in the states of California, Ohio and New York City in comparison to any other states in the months of February and March 2016. This may be because these cities are touristic attractions and densely populated, therefore there is greater potential for accidents to occur.

In [53]:
state_count = res_2016.groupby(["State"])["ID"].count().reset_index(name='Count')
In [54]:
fig = px.pie(state_count, values='Count', names='State',
             title='Accidents by State  (February and March 2016)',
             hover_data=['Count'], labels={'Count':'Count'}, width = 800, height= 800)
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

Similarly, I used this visualisation tool to see how many accidents there were by each state in February and March 2016. It shows both the percentage, count and the state name.

In [55]:
plt.hist(res_2016["Visibility_mi"], bins=50)
plt.title("Histogram for Visibility(mi) of Car Accidents, 2016")
plt.xlabel("Visibility")
plt.ylabel("Frequency")
plt.show()

This histograph reveals the visibility of car accidents in 2016 by the frequency.

In [57]:
sns.barplot(x="Severity",y="Temperature_C", data=res_2016)
plt.title("Barplot")
Out[57]:
Text(0.5, 1.0, 'Barplot')

This barplot shows the number of accidents by severity and temperature (C).

In [58]:
sns.distplot(res_2016["Wind_Speed_mph"],bins=100,kde=True);

This distplot shows the number of accidents by windspeed.

Analysing Wiki Data

In [59]:
wiki_data.head()
Out[59]:
2020_Rank City State 2020_Population 2010_Population Change 2020 land area 2020 land area.1 2020 population density 2020 population density.1 Location
0 1 New York[d] New York 8804190 8175133 +7.69% 300.5 sq mi 778.3 km2 29,298/sq mi 11,312/km2 .mw-parser-output .geo-default,.mw-parser-outp...
1 2 Los Angeles California 3898747 3792621 +2.80% 469.5 sq mi 1,216.0 km2 8,304/sq mi 3,206/km2 34°01′N 118°25′W / 34.01°N 118.41°W
2 3 Chicago Illinois 2746388 2695598 +1.88% 227.7 sq mi 589.7 km2 12,061/sq mi 4,657/km2 41°50′N 87°41′W / 41.83°N 87.68°W
3 4 Houston Texas 2304580 2099451 +9.77% 640.4 sq mi 1,658.6 km2 3,599/sq mi 1,390/km2 29°47′N 95°23′W / 29.78°N 95.39°W
4 5 Phoenix Arizona 1608139 1445632 +11.24% 518.0 sq mi 1,341.6 km2 3,105/sq mi 1,199/km2 33°34′N 112°05′W / 33.57°N 112.09°W

Objective 5: Determine whether there is a correlation between population of a state and the number of accidents.

The linear regression revealed that with an increase in population, the number of accidents certainly increased. The robustness of the ‘r squared’ value revealed the correlation to be quite strong at 0.824. This knowledge was acquired through the Linear Regression Analysis Book by Seber [7].

In [60]:
#print(statesdf)
#print(state_count)

merged_state=state_count.merge(df,on="State")

wholedataset_wiki= wiki_data.merge(merged_state, on="State")
wholedataset_wiki= wholedataset_wiki.rename(columns={"Count": "Accidents"})
wholedataset_wiki

wholedataset =wholedataset_wiki.groupby(["State"]).sum()
wholedataset['%'] = (wholedataset['Accidents']/wholedataset['2020_Population'])*100
wholedataset['State'] = wholedataset.index
wholedataset

wholedataset.plot(x='State', y=["Accidents", "2020_Population"], kind="bar")

ax = plt.subplot()
ax.bar(wholedataset["State"], wholedataset["2020_Population"])
ax.bar(wholedataset["State"], wholedataset["Accidents"], color="maroon")
degrees = 90
plt.xticks(rotation=degrees)
plt.ticklabel_format(style='plain', axis='y')


plt.show()

x= wholedataset["2020_Population"].astype(int)
x.describe

x= wholedataset["2020_Population"].astype(int)
x.describe

# Create figure and plot space

plt.style.use('seaborn')

fig, ax = plt.subplots(figsize=(10, 10))

# Add x-axis and y-axis
ax.scatter(wholedataset['2020_Population'],
           wholedataset['Accidents'],
           color='purple')

# Set title and labels for axes
ax.set(xlabel="Population",
       ylabel="Accidents",
       title="Accidents by state - US\n 2016")

plt.tight_layout()
#plt.plot(monthly_data.index.values, monthly_data['ID'], 'xb-')
plt.show()

from scipy import stats
x= wholedataset["2020_Population"].values
y=  wholedataset.Accidents.values

res = stats.linregress(x,y)
print(f"R-squared: {res.rvalue**2:.6f}")
plt.figure(figsize=(10,8))
plt.plot(x,y, "o", label="original data")
plt.plot(x, res.intercept + res.slope*x, "r", label= "fitted line")
plt.legend()
plt.xlabel("Population")
plt.ylabel("Accidents")
plt.xticks(rotation=degrees)
plt.ticklabel_format(style='plain')
plt.show()



res_2016.describe
import matplotlib.pyplot as plt
from scipy import stats
y= res_2016["Temperature_C"].values
x=  res_2016["Visibility_mi"].values

res = stats.linregress(x,y)
print(f"R-squared: {res.rvalue**2:.6f}")
plt.figure(figsize=(10,8))
plt.plot(x,y, "o", label="original data")
plt.plot(x, res.intercept + res.slope*x, "r", label= "fitted line")
plt.legend()
plt.xlabel("Visibility")
plt.ylabel("Temperature")
plt.xticks(rotation=degrees)
plt.ticklabel_format(style='plain')
plt.show()
R-squared: 0.983319
R-squared: 0.133311

Here, I clean and merge the two datasets on relevant information so that I can visualise it and see it there is a correlation between state size and number of accidents. I do this by utilising a linear regression tool within the SKLearn package. The linear regression revealed that with an increase in population, the number of accidents certainly increased. The robustness of the ‘r squared’ value revealed the correlation to be quite strong at 0.824. This revealed that there was a postive correlation of 0.824 between the number of accidents and state sizes, as hypothesised.

Heatmap

In [61]:
feat_keep = ['Severity', 'Temperature_C',  'Humidity_prc', 'Pressure_in', 'Visibility_mi', 
             'Wind_Direction', 'Wind_Speed_mph', 'Weather_Condition']
features= df[feat_keep]

corr = features.corr()

fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(np.absolute(corr), annot=True)
plt.title('Correlation Heatmap', fontsize=20)
plt.show()

Lastly, there is a heatmap for data visualisation that reveals the correlation between six different factors- Severity, Temperature, Humidity, Pressure, Visibility and Wind Speed.This shows that there is high correlation between Temperature and Visibility, as well as Pressure and Visibility. The rest have a much weaker or negative correlation.

3.2 Datetime data

Accidents during the week

We were interested in seeing the distribution of accidents across the week. Therefore we used seaborn to generate a countplot of the accidents. Below we see that the majority of accidents occur on Thursday, Wedensday and Tuesday. In addition, weekend has the fewest accidents. This is to be expected, since most people commute for work during the week, especially on Tuesday Wednesday and Thursday. Looking at the violin plot on the left below, we see this could be the case, as the kernel density appears to be higher during the peak hours (6-10am and 3-7pm) and lower during inter-peak hours (10am-3pm) and off-peak hours (7pm-6am).

Whilst for most days most accidents have a severity level of 2, during the weekend, most accidents have a level of 4. This could be from people driving whilst alcohol impaired. This is much more likely to lead to a fatal accident. The violin plot on the right demonstrates that this could be the case, with the kernal density being higher in the evening, when people are more likely to have drank alcohol.

In [128]:
df = accidents_clean_df.copy()
df.reset_index( inplace = True)
df['Day'] = df['Start_Time'].dt.day_name()
df['Hour'] = df['Start_Time'].dt.hour
In [129]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
sns.countplot(ax=axes[0],x='Day', data = df).set_title('Number of accidents per weekday');
sns.countplot(ax=axes[1],x='Day', data = df, hue = 'Severity').set_title('Number of accidents per weekday');
In [130]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
sns.violinplot(ax=axes[0], data = df,  x = 'Severity', y = 'Hour').set_title('Plot of hour by Severity');
fri_df = df[df['Day']=='Friday']
sns.violinplot(ax=axes[1], data = fri_df,   y = 'Hour').set_title('Plot of hour on Friday');

3.3 Location data

To get an idea on the distribution of accidents across the different locations, a count plot was generated, as shown below. Here we can see that the junction has the highest number of accidents, with a value of 250, whilst the roundabout and turning loops do not have any accidents. This tells us that perhaps more work needs to be done to make junctions safer. For example, transportation companies could look at updating the road markings. Traffic signals has the second highest number of counts. Again, this is something that may need to be investigated. For example, is there a long enough gap between the green-times of different traffic signals?

This is the total number of accidents for the US. It would probably be more useful to look at specific states or cities, as different states/cities have different characteristics.

In [132]:
# Plot the number of accidents per location
ax, fig = plt.subplots(figsize = (15,6))
df[['Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].sum().plot(kind='barh', 
                                                                               title = 'Number of accidents per location')
fig.set_xlabel('Number of accidents');
Out[132]:
Text(0.5, 0, 'Number of accidents')

Knowing the specific states and cities that have the highest number of accidents could help us narrow down the main causes for the accidents. Below, count plots of accidents per state and city have been generated. Here, we see that California has the highest number of accidents, followed by Ohio. According to The Barnes Firm, states with a higher population have a number of cars on the road and therefore more likely to have crashes [4].

In [133]:
# View count of accidents per city and state
ax, fig = plt.subplots(figsize = (15,6))
city_by_accident=df.City.value_counts()
city_by_accident[:30].plot(kind='barh', title = "30 highest counts of accidents per City");
In [134]:
ax, fig = plt.subplots(figsize = (15,6))
state_by_accident=df.State.value_counts()
state_by_accident.plot(kind='barh', title = "30 highest counts of accidents per State")
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x23cb6ccb9a0>

Since California has the highest number of accidents, we decided to view its distribution. As expected, it has a similar distribution to the total distribution. This couldbe because California makes up a large proportion of the data. Columbus city has a similar distribution. This shows that generally, junctions tend to be the location where most accidents occur.

To get a better view on the number of accidents per state, a heatmap was created and is shown below. Here, we can better easily that other states share this behaviour having highest accidents number of accidents at junctions.

In [135]:
fig, axes = plt.subplots(1,2, figsize = (15,6))
Cali_df = df[df['State'] == 'California']
Cali_df[['Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].sum().plot(ax = axes[0],kind='barh', 
                                                                               title = 'Number of accidents per location (California State)');
axes[0].set_xlabel('Number of accidents')

Cali_df = df[df['City'] == 'Columbus']
Cali_df[['Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].sum().plot(ax = axes[1], kind='barh', 
                                                                               title = 'Number of accidents per location (Columbus City)');
axes[1].set_xlabel('Number of accidents');
In [136]:
state_loc_df =  df[['State', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']]

state_loc_piv = state_loc_df.pivot_table(index = 'State', values = ['Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop'], aggfunc='sum')

fig, axes = plt.subplots(figsize = (15,10))
sns.heatmap(state_loc_piv, cmap = plt.cm.RdPu);
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c943b9bb0>

3.4 Vehicle Ownership data

We are interested in seeing whether vehicle ownership per state is related to the number of accidents. As mentioned above, the Barnes Firm states that States with a higher population have more vehicles on the road, we expect to see a correlation and California to have the largest number of vehicles.

The latest dataset that was found for vehicle per capita was for the year, 2017. This was webscraped. Given that the data is similar to 2015, we assume that 2019 will be roughly the same as 2017. Also, note that the data records only light vehicles.

Interestingly, we find that California does not come up on top. In fact, Wyoming is seen to have the highest number of vehicles per capita. This could be because larger states are more likely to have public transport systems, lessening the need for people to have their own vehicles. Wyoming on the other hand, has one of the smallest populations, and therefore there may not be accessible public transport, giving the need for more private vehicle ownership.

Note that Wyoming is not in our main datset,indicating just how few accidents they have, if any.

In [137]:
# Webscrape vehicle ownership per State in 2017 
url = "https://en.wikipedia.org/wiki/List_of_U.S._states_by_vehicles_per_capita"
table_class= "wikitable sortable jquery-tablesorter"
response = requests.get(url)
print(response.status_code)


res = requests.get("https://en.wikipedia.org/wiki/List_of_U.S._states_by_vehicles_per_capita")
soup = BeautifulSoup(res.content, "html.parser")
car_own = soup.find_all('table', {'class':"wikitable sortable mw-collapsible"})

df2 = pd.read_html(str(car_own[1]))
df2 = pd.DataFrame(df2[0])
200

Next, the vehicle count was merged with the main dataframe and compared to the accident count. Glancing at the table below, we cannot really see a correlation between the two variables. For example, Iowa has the highest number of vehicle ownership of 1050, with a relatively low number of accidents. On the other hand, California has an accident count of 960, even though their vehicle ownership is 840.

The plot below, visualises the proportion of vehicle ownership and accident count. Here we can see more clearly the distributions.

A scatter plot was also created to see whether there is a correlation between the two variables, and again we can see that there was no apparent linear relationship between vehicle ownership and the number of accidents per state. This is confirmed by the correlation calculated below, showing a value of -0.07, almost 0! Therefore, we can conclude that there is no correlation.

In [138]:
df3 = df.merge(df2, how = 'left', left_on='State', right_on= 'State')
veh_piv = df3.pivot_table(index = ['State', 'Vehicles per 1000 People'], values = 'ID', aggfunc = 'count')
veh_piv.rename(columns={'ID':'Accident count'}, inplace = True)
veh_piv.reset_index(inplace = True)
veh_piv.sort_values(by = 'Vehicles per 1000 People',ascending = False).head()
Out[138]:
State Vehicles per 1000 People Accident count
4 Iowa 1050 22
12 Nebraska 1046 20
2 Delaware 950 8
3 Indiana 914 142
16 Ohio 910 787
In [139]:
ax, fig = plt.subplots(figsize = (30,8))
plt.bar(veh_piv['State'], veh_piv['Vehicles per 1000 People'])
plt.bar(veh_piv['State'], veh_piv['Accident count'], bottom=veh_piv['Vehicles per 1000 People'])
plt.show()
In [191]:
veh_piv.reset_index(inplace =True)
fig, ax = plt.subplots(figsize= (30,20))
plt.scatter(veh_piv['Accident count'], veh_piv['Vehicles per 1000 People'])
states = veh_piv['State']


texts = []
for i, name in enumerate(states):
    texts.append(ax.text(veh_piv['Accident count'][i],veh_piv['Vehicles per 1000 People'][i],  name))
    
adjust_text(texts)
plt.rcParams.update({'font.size': 30})
ax.set_title("Vehicle ownership vs accident count",fontsize=30)
ax.set_xlabel("Vehicles per 1000 People",fontsize=20)
ax.set_ylabel("Accident count",fontsize=20)
plt.show() 
In [192]:
states_count = veh_piv[['State', 'Accident count']]
states_count.set_index('State', inplace = True)
veh_piv.set_index('State', inplace = True)
veh_piv.corr()
Out[192]:
Vehicles per 1000 People Accident count
Vehicles per 1000 People 1.000000 -0.073624
Accident count -0.073624 1.000000

3.5 Traffic Google Maps data

Get Traffic data from Google Maps

Scrap traffic data from Google Maps. The data is about the time needed for a car to go from the Start point of the accident to the End point of the accident at the time of the accident and 30 mins before.

The code for extracting this data is commented out, because each user is aloud to make a limited amount of free queries, hence this code shouldn't run twice with the same api key. The data was saved in a csv file, which is read below.

(Note: This traffic data was retrieved for about 1000 accidents and not for all the accidents we are analysing in this project, as the google maps api calls required credits after a specific amount of requests.)

Information about the API used to retrieve Google Maps Directions can be found here: https://app.outscraper.com/api-docs

In [148]:
# #Store the API key into a python object
# api_client = ApiClient(api_key='YXV0aDB8NjFiYjkyNGNiNmZkMzAwMDZiMDZiMDNlfDMzZDUyZTQ5ZDk')

# #Get a subset of rows of the main dataset and select only the columns needed fro scrapping google maps data
# accidents_sample_df = accidents_df.iloc[0:1200 , 2:8]
# accidents_sample_df

# #Scrap traffic data from google maps, regarding the time needed to go from the accident start point 
# #before and during the accident


# #Loop through each row of the dataframe
# for index, row in accidents_sample_df.iterrows():
    
#     #store the start and end coordinates in a format that googel maps accepts
#     coords = np.array([ str(row['Start_Lat']) + ', ' +  str(row['Start_Lng']) + '    ' + str(row['End_Lat']) + ',' + str(row['End_Lng'])])
    
    
#     #---------At the accident time:
    
#     #Transform the value of the Start_Time column into datetime and then into timestamp
#     dttm = datetime.strptime(row['Start_Time'], '%Y-%m-%d %H:%M:%S') 
#     timestamp = int(datetime.timestamp(dttm))
    
#     try:
        
#         google_maps = api_client.google_maps_directions(coords, departure_time = timestamp)

#         gm_dict =google_maps[0] #get the dictionary from the list of 1 element
#         accidents_sample_df.loc[index, 'duration']= gm_dict['duration(minutes)']
#         accidents_sample_df.loc[index,'duration_max']= gm_dict['duration_max(minutes)']
#         accidents_sample_df.loc[index,'speed']= str(gm_dict['road_distance_timing(meters:seconds)'])    
#         print(index)
    
#     except: # takes care of the case when the API call will not be successful 
#         print (coords, dttm, ' did not work')
    
#     #----------30 mins before the accident time:
#     timestamp_30bef = timestamp - 1800
    
#     try:
#         google_maps = api_client.google_maps_directions(coords, departure_time = timestamp_30bef)

#         gm_dict =google_maps[0] #get the dictionary from the list of 1 element
#         accidents_sample_df.loc[index, 'duration_30bef']= gm_dict['duration(minutes)']
#         accidents_sample_df.loc[index,'duration_max_30bef']= gm_dict['duration_max(minutes)']
#         accidents_sample_df.loc[index,'speed_30bef']= str(gm_dict['road_distance_timing(meters:seconds)'])
#         print(index)
        
#     except:
#         print (coords, dttm, ' did not work')

# #save dataframe to csv file        
# accidents_sample_df.to_csv('gm_all.csv', index = True)
# gm_all_df
In [149]:
#Load google maps data that were generated before:
gm_all_df = pd.read_csv ('gm_all.csv')
gm_all_df.head()
Out[149]:
ID Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng duration duration_max speed duration_30bef duration_max_30bef speed_30bef
0 A-2716600 08/02/2016 00:37 08/02/2016 06:37 40.10891 -83.09286 40.11206 -83.03187 3.0 4.0 {'0': 0, '441': 15, '560': 19, '1170': 39, '22... 3.0 4.0 {'0': 0, '441': 14, '560': 19, '1170': 39, '14...
1 A-2716601 08/02/2016 05:56 08/02/2016 11:56 39.86542 -84.06280 39.86501 -84.04873 1.0 1.0 {'0': 0, '326': 10, '1161': 36, '1202': 37} 1.0 1.0 {'0': 0, '326': 10, '1161': 36, '1202': 38}
2 A-2716602 08/02/2016 06:15 08/02/2016 12:15 39.10266 -84.52468 39.10209 -84.52396 1.0 1.0 {'0': 0, '47': 2, '89': 3} 1.0 1.0 {'0': 0, '89': 3}
3 A-2716603 08/02/2016 06:15 08/02/2016 12:15 39.10148 -84.52341 39.09841 -84.52241 1.0 1.0 {'0': 0, '9': 0, '182': 7, '206': 8, '285': 11... 1.0 1.0 {'0': 0, '9': 0, '182': 7, '206': 8, '285': 11...
4 A-2716604 08/02/2016 06:51 08/02/2016 12:51 41.06213 -81.53784 41.06217 -81.53547 6.0 7.0 {'0': 0, '208': 7, '250': 8, '415': 14, '515':... 6.0 7.0 {'0': 0, '250': 8, '415': 14, '515': 17, '1045...

Data Cleansing and Data Engineering

The columns speed and speed_30bef were extracted from Google Maps and are in the format of a sting - dictionary. Below we convert them into meaninglful measurements about the speed. In particular, Google Maps returned to us in a dictionary format the distance and the time needed to cover this distance in different location points within the start and end location point of the accident. Below, using this information, we calculate the maximum speed that a car could move within the accident area and create the columns speed_max and speed_30bef_max. The first column refers to the maximum speed at the time of the accident and the second column to the maximum speed 30 mins before the accident. The difference of these 2 variables is also calculated and stored in a new column called speed_max_diff.

The way that speed difference is calculated is so that when its value is positive it represents an increase in teh speed at the time of the accident.

In [150]:
#Transform the speed and speed_30bef columns, which are in the format of a sting - dictionary into meaninglful measurements about the speed

#Iterate through the rows of the dataframe
for index, row in gm_all_df.iterrows():
    
    #If the API call to google maps was not successful and the result is Nan then skip this iteration
    #and continue to the next
    if pd.isnull(gm_all_df.loc[index, 'speed']):
        continue
    
    if pd.isnull(gm_all_df.loc[index, 'speed_30bef']):
        continue
    
    #Create a function that accepts a column name, which has the speed in dictionary format saved, 
    #extracts its values and creates an array with the speed in different distance points:  
    def get_speed (column_name):
        #Extract the value from the speed column of the particular row and save it in an object:
        speed1 = gm_all_df.loc[index, column_name]

        #The next function needs double quotes around keys and values, so we replace single quotes with double
        speed1_string = speed1.replace("'", "\"") 

        #Transform string into dictionary
        speed1_dict = json.loads(speed1_string) 

        #Remove the items from the dictionary with zero time. This will facilitate the numeric calculations (divide) later on
        for k, v in list(speed1_dict.items()):
            if v == 0:
                del speed1_dict[k]
        
        #Get the keys of the dictionary into a list. The keys represent the meters. Their format is string.
        speed1_meters_str= list(speed1_dict.keys())

        #Perform conversion of the string elements in the list into integers (meters) using list comprehension
        speed1_meters_num = [int(i) for i in speed1_meters_str]
        
        #Get the values of the dictionary into a list. 
        #The values represent the seconds needed for a car to cover the distance mentioned in the keys of the dictionary
        speed1_seconds_num = list(speed1_dict.values())

        #Transform the lists into numpy arrays in order to perform division of their elements and store them into a new numpy array.
        #Dividing meters/seconds returns the speed.
        a = np.array(speed1_meters_num)
        b = np.array(speed1_seconds_num)
        speed_array = np.divide(a,b)
        
        # Return the array with the speed in different distance points for the particular accident at the particular time
        return speed_array
 

    #Get speed array for the column 'speed':
    speed_at_accident = get_speed(column_name = 'speed')
    
    if len(speed_at_accident) == 0 :
        continue
    
    #Calculate maximum speed, round it to 2 decimal places and save it in the dataframe as new column
    gm_all_df.loc[index, 'speed_max'] = np.around(np.max(speed_at_accident), 2)

    #Get speed array for the column 'speed_30bef':
    speed_30bef_accident = get_speed(column_name = 'speed_30bef')
    
    if len(speed_30bef_accident) == 0 :
        continue
    
    #Calculate maximum speed, round it to 2 decimal places and save it in the dataframe as new column
    gm_all_df.loc[index, 'speed_30bef_max'] = np.around(np.max(speed_30bef_accident), 2)
    
    #Convert duration from minutes to seconds
    gm_all_df.loc[index, 'duration'] = gm_all_df.loc[index, 'duration']  * 60
    gm_all_df.loc[index, 'duration_max'] = gm_all_df.loc[index, 'duration_max']  * 60
    gm_all_df.loc[index, 'duration_30bef'] = gm_all_df.loc[index, 'duration_30bef']  * 60
    gm_all_df.loc[index, 'duration_max_30bef'] = gm_all_df.loc[index, 'duration_max_30bef']  * 60

gm_speed_df = gm_all_df.drop(['speed', 'speed_30bef', 'Start_Time', 'End_Time' ,  'Start_Lat' ,  'Start_Lng'  , 'End_Lat' , 'End_Lng'], axis = 1)    
gm_speed_df= gm_speed_df.set_index('ID')
gm_speed_df.head()
Out[150]:
duration duration_max duration_30bef duration_max_30bef speed_max speed_30bef_max
ID
A-2716600 180.0 240.0 180.0 240.0 30.31 31.50
A-2716601 60.0 60.0 60.0 60.0 32.60 32.60
A-2716602 60.0 60.0 60.0 60.0 29.67 29.67
A-2716603 60.0 60.0 60.0 60.0 26.00 26.00
A-2716604 360.0 420.0 360.0 420.0 31.25 31.25
In [151]:
#Combine google maps and accidents datasets
accidents_gm_speed_df = pd.concat([accidents_clean_df, gm_speed_df,], join = 'inner', axis = 1)

#drop rows with missing values
accidents_gm_speed_df = accidents_gm_speed_df.dropna()

#Get the name of the day and the hour
accidents_gm_speed_df['Day'] = accidents_gm_speed_df['Start_Time'].dt.day_name()
accidents_gm_speed_df['Hour'] = accidents_gm_speed_df['Start_Time'].dt.hour

#List of different binary columns we want to use to split the data
bin_cols =[ 'Crossing' ,'Junction', 'Amenity', 'Stop',  'Traffic_Signal'] 

accidents_gm_speed_df['Weekend'] = accidents_gm_speed_df['Day'].apply(lambda x: 'Yes' if x in ['Saturday', 'Sunday'] else 'No')
for c in bin_cols:
    accidents_gm_speed_df[c] = accidents_gm_speed_df[c].apply(lambda x: 'Yes' if x ==1 else 'No')
    
#Calculate the difference of the maximum speed 30 misn before the acident and at the time of the accident.
accidents_gm_speed_df['speed_max_diff']= accidents_gm_speed_df['speed_max'].sub(accidents_gm_speed_df['speed_30bef_max'],
                                                                                axis=0)
#If speed_max_diff > 0 then the speed at the time of the accident is higher than before
#If speed_max_diff < 0 then the speed is lower at the time of the accident

Data Exploration and Visualisations of Google Maps traffic data in combination with the accidents data

The traffic data, extracted from Google Maps are going to be explored to answer the following questions:

  • Did the accident affect the speed at which a car could cross the accident area and hence the traffic of the area affected?
  • Did the speed at which cars were moving 30 minutes before the accident has any correlation with the severity of the accident? Our intuition is that a higher speed will result in higher severity and this is the reason that we choose to analyse the maximum speed and not the average speed.
  • Do the road conditions, like the existance of a crossing, or the day of the week (weekend or not) affected positively or negatively the difference in the speed that a car could move after the accident?

    To answer these questions we first plot the distribution of the maximum speed per severity level and the distribution of the speed difference per severity level. Afterwards, we plot violin plots for the distribution of the speed difference by severity level, splitting the data based on a different column each time, in order to investigate the impact that road conditions or the day of the week have. The attributes that we investigate relative to the speed difference are: Crossing ,Junction, Amenity, Stop, Traffic_Signal and Weekend.

In [152]:
#Plot the distribution of the speed per severity and the speed difference per severity

fig, ax = plt.subplots(1, 2, figsize=(14,6))

accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==2]['speed_max'].plot(legend=True, kind = 'density', ax = ax[0])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==3][ 'speed_max'].plot(legend=True, kind = 'density', ax = ax[0])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==4][ 'speed_max'].plot(legend=True, kind = 'density', ax = ax[0])

ax[0].set_title('Speed distribution', fontsize=15, y =1.1)
ax[0].set_ylabel('Density', fontsize=15)
ax[0].set_xlabel('Speed (seconds/meter)', fontsize=15)
ax[0].xaxis.set_label_coords(1.1, -0.1)
ax[0].yaxis.set_label_coords(-0.2,0.5)
ax[0].legend(labels =['Severity 2','Severity 3','Severity 4'], loc='top left', fontsize=11)

accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==2]['speed_max_diff'].plot(legend=True, kind = 'density', ax = ax[1])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==3]['speed_max_diff'].plot(legend=True, kind = 'density', ax = ax[1])
accidents_gm_speed_df[accidents_gm_speed_df['Severity'] ==4]['speed_max_diff'].plot(legend=True, kind = 'density', ax = ax[1])
ax[1].set_title('Speed Difference distribution', fontsize=15, y =1.1)
ax[1].set(ylabel=None)
ax[1].legend(labels =['Severity 2','Severity 3','Severity 4'], loc='top left', fontsize=11)

plt.show()

Observations:

From the speed density plot above we see that the maximum speed that a car needs to cover the area of the accident at the time of the accident, based on Google Maps data, follow a normal distribution with a very similar behaviour for accidents with severities 2 & 3, where the median speed is around 30 seconds per meter. Accidents with severity 4 differ, as their median speed is a bit lower, around 20 seconds per meter, opposite to what we were expecting. We also observe that severity 4 accidents have a higher spread of their speed value. This might mean that usually the low severity accidents happen in roads with a common speed limit.

From the speed difference density plot above we see that all accidents follow a very similar behaviour regarding the speed difference before and during the accident, where the median value is close to 0, meaning that usually there is no big speed difference. However, the tail of the severity 2 accidents is long but thin, implying the presence of some outliers. To facilitate the next visualisations, we remove the accidents with a speed difference that is more than 3 standard deviations away from the mean. In that way we can focus our analysis on the majority of the accidents.

In [153]:
'''Standardize speed differences and remove the accidents with a speed difference 
more than 3 standard deviations away from the mean.'''

data = accidents_gm_speed_df['speed_max_diff']
data_z = (data-data.mean())/(data.std())
gm_speed_df_3stds = accidents_gm_speed_df[(np.absolute(data_z) < 3) ] 
In [154]:
bin_cols.append("Weekend")

sns.reset_orig()
f, axes = plt.subplots(2,3, figsize=(15,15), sharey=True, sharex=True)

for i, c in enumerate(bin_cols): #iterate through the columns based on which we want to split the data
    #set the correct values for the axes indeces:
    if i <3 :
        row = 0
        j = i
    else :
        j =i-3
        row =1
        
    sns.violinplot(x="Severity", y="speed_max_diff", data=gm_speed_df_3stds, hue = c, points=100,
                   palette="RdBu", split=True, ax=axes[row,j]).set_title('Accidents split by ' + c)
    axes[row,j].legend(bbox_to_anchor=(0.9, 1), loc=2, borderaxespad=0., title=c, fontsize =12, title_fontsize=12).get_frame().set_alpha(None);
    axes[row,j].set(xlabel=None)
    axes[row,j].set(ylabel=None)
    
plt.suptitle('Maximum Speed Difference vs Severity', fontsize = 22);
f.subplots_adjust(wspace = 0.5); #adjust the plots so that the first column has some space from the second 
axes[0,0].set_xlabel("Severity", size=20)
axes[0,0].set_ylabel("Difference in the maximum Speed", size=20)
axes[0,0].xaxis.set_label_coords(2, -1.3)
axes[0,0].yaxis.set_label_coords( -0.2, -0.1)

Observations:

From the accidents split by crossing violin plot we understand that the big majority of accidents that occured at a Crossing resulted in the highest severity level. We also understand that the speed difference for these accidents follow a very different distribution with very heavy tails, meaning that when an accident occured at a crossing with severity 4, the traffic was affected with a high variability, meaning that the speed could be increased or reduced with a high variability.

A high variability in the speed difference (although not as high) is observed again for severity 4 accidents, when a traffic signal was at the accident location. However, we see that the traffic signal influences positively the trafic, meaning that when it's present, usually the speed at which the cars can move increases. This is derived by the right-skewed distribution, where Severity = 4 and Traffic_Signal = Yes (Blue). The intuition behind that could be that traffic signal facilitate the traffic and hence there is less need for a human intervation to control the traffic, compared to road where there is not traffic signal.

High variability in the speed difference at severity 4 accidents is also seen on Weekends. This could be explained by the fact that on weekends the police staff is less and more people are going for trips or in places they haven't been before. This of course can make the speed vary more.

It is also interesting to observe from the accidents split by amenity violin plot, that almost none of the highly severe accidents happend close to an amenity. Also, the severe 2 accidents that occured next to an amenity clearly affected negatively the traffic, by reducing the speed that a car could move. This can be intuitively explained by the fact that usually when amenities are on a road it means that the road is in a residential area and probably small, and hence an accident might block a big part of it.

3.6 Accident Description data

When the accidents in US are recorded, a Description of them is also recorded. Obviously the Description is created after the accident, so it will not be used for the purpose of the Severity prediction. However, we would like to explore the features of the Description in relation to the Severity to understand whether a different format is followed in each severity level and whether information about the features of the accident can be extracted, that help us understand more what the different severity levels represent.

For a start, we print one random example of the Description column from each Severity level, in order to have an overall understanding of the text.

In [155]:
description_df = accidents_clean_df[['Description', 'Severity']]

#drop missing values if any on the description column
#description_df= description_df.dropna()

#Get the value of one Description from each severity level to display as an example
desc2_str = description_df.loc[description_df['Severity'] == 2,'Description'][100] #100 is just a random number
desc3_str = description_df.loc[description_df['Severity'] == 3,'Description'][100]
desc4_str= description_df.loc[description_df['Severity'] == 4,'Description'][100]

Example of the Description text for each Severity level

In [207]:
print("\n")
print(100 * '^')
print('Severity 2: '+ Fore.GREEN  + desc2_str)
print(Fore.BLACK + 'Severity 3: '+ Fore.BLUE + desc3_str)
print(Fore.BLACK + 'Severity 4: '+ Fore.RED + desc4_str)
print(Fore.BLACK +100 * '^')
print("\n")

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Severity 2: At I-475/Exit 204 - Accident.
Severity 3: Between Mitchell Ave/Exit 6 and OH-562/Exit 7 - Accident.
Severity 4: Closed between I-70 and US-40/National Rd - Road closed due to accident.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


The first observation, is that Severity 2 level accidents are described by the location of the accident, Severity 3 by a wider area and severity 4 by the consequences of the accident on the traffic. However, this observation might be random, and therefore we perform a more detailed analysis below.

We calculate and plot the average number of words that are used for the description of an accident for each severity level.

In [156]:
#Get the number of words
description_df['desc_word_count'] = description_df['Description'].apply(lambda x: len(str(x).split()))


#Calculate and plot the average number of words that appear in the Desciption column, per Severity level
descr_agg_df = description_df.groupby('Severity')['desc_word_count'].agg('mean')

fig, ax = plt.subplots( figsize=(15, 4))
descr_agg_df.sort_values(ascending=True).head(3).plot(kind='bar', color=(['g','b','r']), alpha=0.6)

ax.set_ylabel('Number of words', fontsize =14)
ax.set_xlabel('Severity', fontsize =14)
ax.set_xticklabels(['2', '3', '4'], rotation=0, fontsize='medium');

plt.title('Average number of words in the Description of each accident, for each severity level',fontsize=16, y =1.1);

From the bar plot above is clear that accidents with severities 2 and 3 are described with a similar number of words (around 7), whereas accidents with a severity of 4 are described with double words.

For the next step of the text analysis we will consider as our corpus the Description values for each severity level separately, split the text in monograms (words) and count the number of accident descriptions that each word appeared (not the total number of times the word appeared). We will not remove stopwords, like and because as seen from the Description examples above, the text has a semi-structured format, where words like and or at can reveal the type of accident. For example the word and shows that there were more than one road involved in the accident.

We will then transform the "word count" into "word frequency" by dividing the total number of descriptions that a word occured, with the total number of accidents (for a severity level). We will use the "word frequency" to plot word clouds and a bar plot for the frequency of the top 16 words.

It was seen that the word accident was used in every description, and hence was eliminated from the analysis as it wouldn't provide any insight.

In [157]:
'''Create a function that splits the text in monograms (words) and count the number of accident descriptions 
that each word appeared (not the total number of times the word appeared) and returns a dictionary with the results.'''

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(binary=True).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_count = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_count =sorted(words_count, key = lambda x: x[1], reverse=True)[:n]

    word_dic = {}
    for word, count in words_count:
        word_dic[word] = count

    del word_dic["accident"] # Delete accident, which is present in all records
      
    return word_dic
In [158]:
word_dic_sev2 = get_top_n_words(description_df.loc[description_df['Severity'] == 2,'Description'], 16)
word_dic_sev3 = get_top_n_words(description_df.loc[description_df['Severity'] == 3,'Description'], 16)                                                   
word_dic_sev4 = get_top_n_words(description_df.loc[description_df['Severity'] == 4,'Description'], 16)

# Generate word cloud images for different severities
fig = plt.figure(figsize = (16,20))

for c in (2,3,4): #for the different values of severity
    wordcloud= WordCloud(background_color="AliceBlue",
                                        contour_width=5, contour_color='steelblue'
                                       ).generate_from_frequencies(globals()[f'word_dic_sev{c}'])

    plt.subplot(6,3, (c-1)).set_title("Severity " + str(c), fontsize= 15)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    
fig.suptitle('Word Cloud images per Severity', fontsize= 22, y = 0.95)
plt.show()

#Print multiple empty lines to make multiple graph display clean
print(2 * "\n") 
print(115 *'^')
print(2 * "\n") 


#Plot a bar chart with the most common words for each severity level
def get_word_freq(sev):
    #Transform dictionary into a dataframe
    data_items = globals()[f'word_dic_sev{sev}'].items()
    data_list = list(data_items)
    word_df = pd.DataFrame(data_list)
    word_df.columns =['word', 'count'] #name the columns of the dataframe
    word_df = word_df.set_index(['word']) #set index to word    

    #Count the number of accidents for each severity level
    n_accidents_sev = description_df.loc[description_df['Severity'] == sev].shape[0]
    
    #calculate the frequency of each word based on the count and the number of accidents each severity level has,
    #so that the number next to each word can be comparable across different severityt levels
    word_df['freq']= word_df['count'].apply(lambda x: np.around((x/ n_accidents_sev),decimals = 2))
    word_df= word_df.drop(['count'], axis= 1)
    
    return word_df


for sev in (2,3,4):
    globals()[f'word_sev{sev}_df']= get_word_freq(sev)
    
word_freq_all_df = pd.concat([word_sev2_df, word_sev3_df, word_sev4_df], axis=1, join ='outer', ignore_index =True)
word_freq_all_df.columns =['Severity 2', 'Severity 3', 'Severity 4']

for sev in (2,3,4):    
    word_freq_all_df['Severity ' + str(sev)] = word_freq_all_df['Severity ' + str(sev)].fillna(0)

word_freq_all_df.sort_values(by=['Severity 2', 'Severity 3', 'Severity 4'], ascending=False)

#bar chart
ax = word_freq_all_df.plot.bar(edgecolor='none', alpha = 0.5, figsize=(14,4), fontsize=14, color =('g','b', 'orange'));
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.,fontsize=12)
ax.set_xlabel('Words', fontsize =17);
ax.set_ylabel('Frequencies' , fontsize =17)
ax.set_title('Most common words on the Description', fontsize = 24, pad =35, y =1.05)
plt.suptitle('(sorted by severity 2,3,4)', y=1.01, fontsize=14);


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



Observations:

From the Word Cloud we see that the words exit and at have the highest frequency for severities 2 and 3, whereas for the accidents with severity 4 the words closed, road, to and due apear more often. This is also confirmed from the bar plot, where severity 2 and 3 follow a similar pattern, opposite to severity 4.

From these visualisations, we can conclude that accidents are labeled with a severity level of 4, when a road needs to close because of the accident. We also see that accidents with a severity of 2 or 3 often happen at a highway exit.

4. Dimensionality Reduction

The categorical variables of City, State and Weather_Condition have a high granularity level of 813, 20 and 23 respectively, as seen from the initial Data Exploration part. To perform predictive modelling, some algorithms can only handle numeric variables. Transforming these categorical variables into dummies (numeric) can add a lot of complexity in the models, as it will require that just for the City variable we will get 813 new dummy variables. To address this issue and at the same time find a way to include the City, State and Weather_Condition information in the model, we will convert these categorical data into dummy variables and then try to reduce their high dimensionality into 1, 2 or 3 numeric components. However, we are aware that this process will reduce the interpretability power of the models that will be created, but accuracy will be improved.

Therefore, in this part of the project, we perform the following actions:

  • Transform the 3 categorical variables into dummy variables
  • Apply the dimensionality reduction techniques of Principal Component Analysis (PCA), Multi Dimensional Scaling (MDS), Isomap, Stochastic Neighbor Embedding (t-SNE), Spectral Embedding (SpEmb) and Locally Linear Embedding (LLE), to get 1, 2 and 3 components that will represent the combination of the City and State variables and 1, 2 and 3 components to represent the Weather_Condition variable.
  • Calculate the correlation of the generated embeddings with the Severity (as Severity will be the target for the Predictive models). We pick the combination of components with the highest correlation, to try their performance on the predictive modelling, especially for the models that assume a linear relationship, as the correlation capture the linear similarity.
  • Plot scatter plots of the 2-component embeddings, colored by the severity, in order to see whether we can identify a relationship with the severity and hence try the equivallent embeddings as variables in the predictive modelling.
In [159]:
#Create objects that will be needed later

Y_true = accidents_clean_df['Severity'].to_numpy() #save the label of severity in an array

models =("PCA", "MDS", "Isomap", "SpEmb", "LLE", "tSNE")

n_rows = len(Y_true) #number of rows

#list of colors we want to use in the scatter plots
cmap = LinearSegmentedColormap.from_list('severities', ['turquoise', 'darkviolet','red'])

sev = ['2','3','4'] #unique severity levels
In [160]:
#Create a function that will run the dimesionality reduction techniques for the given dataset and number of components.

def dimensionality_red_methods(X, comp): #input: dataset to reduce its dimentionality and number of final components
    #PCA
    clf = decomposition.TruncatedSVD(n_components=comp) # This is the PCA model. we want to reduce the dimensionality
    X_PCA = clf.fit_transform(X-X.mean(axis=0)) #We fit the molde after centering the data to 0
                                           
    #MDS
    clf = manifold.MDS(n_components=comp, n_init=1, max_iter=500)
    X_MDS = clf.fit_transform(X)

    #Isomap
    n_neighbors = 30
    clf = manifold.Isomap(n_neighbors=n_neighbors, n_components=comp)
    X_Isomap = clf.fit_transform(X)

    #Spectral Embedding
    n_neighbors = 30
    clf = manifold.SpectralEmbedding(n_components=comp, random_state=0, eigen_solver="arpack", affinity='nearest_neighbors', n_neighbors=n_neighbors)
    X_SpEmb = clf.fit_transform(X)

    #LLE
    clf = manifold.LocallyLinearEmbedding(n_neighbors=n_neighbors, n_components=comp, method='standard')
    X_LLE = clf.fit_transform(X)

    #t-SNE
    clf = manifold.TSNE(n_components=comp, init='pca', random_state=0, n_iter=500)
    X_tSNE = clf.fit_transform(X)
    
    return X_PCA, X_MDS, X_Isomap, X_SpEmb, X_LLE, X_tSNE

4.1 For City, State variables

Transform City and State variables into binary dummy variables.

In [161]:
location_dummies_df = pd.get_dummies(accidents_clean_df[['City',  'State']], columns=['City',  'State'], prefix_sep='_')

X_city_dummies = location_dummies_df
X_city_dummies = X_city_dummies.to_numpy()

Run the 6 dimensionality reduction methods for the combination of the City and State variables, for 1, 2 and 3 components.

In [162]:
drm_city_1 = dimensionality_red_methods(X_city_dummies, 1)
drm_city_2 = dimensionality_red_methods(X_city_dummies, 2)
drm_city_3 = dimensionality_red_methods(X_city_dummies, 3)

'''Create a matrix for each group of embeddings that is produced. 
For example, the matrix X1_city_PCA will store the 1 component produced from the PCA method for the City/State variables.
The matrix X2_city_PCA will store the 2 components produced from the PCA method for the City/State variables.'''

topic = 'city'
for comp in range(1,4):

    for i, m in enumerate(models):
        X = globals()[f'drm_{topic}_{comp}'][i]
        
        # normalize the data to be in the range [0,1]
        x_min, x_max = np.min(X, 0), np.max(X, 0)
        globals()[f'X{comp}_{topic}_{m}'] = (X - x_min) / (x_max - x_min)  

Calculate the correlation of the new City/State related embeddings with Severity. We will try to use the correlations to decide on the more useful embeddings to be used in the predictive modelling for severity.

In [163]:
def correlation_of_embeddings_with_severity (topic):
    for comp in range(1,4):
        df  = pd.DataFrame() #initialize the dataframe to store the results 

        for i, m in enumerate(models):

            if comp == 1:
                X_Y = np.concatenate([globals()[f'X{comp}_{topic}_{m}'].reshape(n_rows,1), 
                                       Y_true.reshape(n_rows,1)], 
                                      axis =1)
            if comp ==2:
                X_Y = np.concatenate([globals()[f'X{comp}_{topic}_{m}'][ :, 0].reshape(n_rows,1), #globals()[f'X{comp}_{m}'] is for X2_PCA
                           globals()[f'X{comp}_{topic}_{m}'][ : ,1].reshape(n_rows,1) , 
                           Y_true.reshape(n_rows,1)], 
                          axis =1)

            if comp ==3:            
                X_Y = np.concatenate([globals()[f'X{comp}_{topic}_{m}'][ :, 0].reshape(n_rows,1), #globals()[f'X{comp}_{m}'] is for X2_PCA
                               globals()[f'X{comp}_{topic}_{m}'][ : ,1].reshape(n_rows,1) , 
                               globals()[f'X{comp}_{topic}_{m}'][ : ,2].reshape(n_rows,1) ,
                               Y_true.reshape(n_rows,1)], 
                              axis =1)

            X_Y_df = pd.DataFrame(X_Y)
            corr = np.around(X_Y_df.corr(method ='pearson'),decimals= 2)   

            df[m] = corr.iloc[0:comp,comp]

        display(df)
        
        
correlation_of_embeddings_with_severity('city')  
PCA MDS Isomap SpEmb LLE tSNE
0 0.19 -0.04 0.24 0.2 -0.22 0.22
PCA MDS Isomap SpEmb LLE tSNE
0 0.19 0.06 0.24 0.20 -0.20 0.21
1 0.13 -0.15 0.04 -0.12 0.22 0.09
PCA MDS Isomap SpEmb LLE tSNE
0 0.19 0.08 0.24 0.20 -0.13 0.22
1 0.13 0.01 0.04 -0.12 -0.20 0.12
2 -0.07 0.19 -0.14 0.13 0.22 -0.05

Observations:

It can be seen that the 1-component embedding from Isomap has the highest correlation (0.24) with severity, so we will try that variable for the predictive model (X1_city_Isomap).

From the 2-component embeddings it is harder to decide based on correlation, as the city information is shared across 2 variables. Isomap has the highest correlation in the first component, but a very low correlation in the second. LLE on the other hand has a very high correlation in both. To facilitate the decision regarding the 2 components, scatter plot will be shown after.

From the 3-component embeddings LLE and t-SNE have interesting correlations.

Below we plot embeddings of 2 components for City/State:

In [164]:
def scatter_model(row, col, model, model_emb):
    sc = ax[row,col].scatter(model_emb[:,0], model_emb[:,1], s=17,
                c = accidents_clean_df['Severity'].astype('category').cat.codes,  cmap=cmap)
    ax[row,col].set_title(model,  fontsize = 24)
    return sc

fig, ax = plt.subplots(2, 3, figsize=(17,14), sharey=True, sharex=True)

scatter = scatter_model(0,0,'t-SNE', X2_city_tSNE)     
scatter_model(0,1,'PCA', X2_city_PCA)
scatter_model(1,0,'Isomap', X2_city_Isomap)
scatter_model(1,1,'MDS', X2_city_MDS)
scatter_model(0,2,'LLE', X2_city_LLE)
scatter_model(1,2,'SpEmb', X2_city_SpEmb)

ax[0,0].set_xlabel("Embedding 1", size=20)
ax[0,0].set_ylabel("Embedding 2", size=20)
ax[0,0].xaxis.set_label_coords(1.7, -1.3)
ax[0,0].yaxis.set_label_coords( -0.1, -0.14)

# add legend to the plot with names
ax[0,0].legend(handles=scatter.legend_elements()[0], 
           labels=sev,
           title="Severities", bbox_to_anchor=(-0.35, 1.3), loc=2, borderaxespad=0., fontsize =16, title_fontsize=16);

fig.suptitle('City/State Embeddings', fontsize = 24, y =1.0); #Main title

t-SNE plot is very interesting as cluster of the severity 4 accidents (red) can be identified. The correlation calculated before captured the linear relationship of each component with severity, but this plot reveals some relationship with severity, but not clearly linear. So, these 2 components derived from t-SNE will be passed to the modelling part to test whether they perform well in predicting severity.

The rest of the plots are not very helpful in identifying a relationship with severity, so the components derived from the rest of the methods will not be used.

4.2 For Weather Condition variable

Transform Weather_Condition variable into binary dummy variables.

In [165]:
weather_dummies_df = pd.get_dummies(accidents_clean_df['Weather_Condition'], columns=['Weather_Condition'], prefix_sep='_')

X_weather_dummies = weather_dummies_df
X_weather_dummies = X_weather_dummies.to_numpy()

Run the 6 dimensionality reduction methods for the combination of the Weather_Condition variable, for 1, 2 and 3 components.

In [166]:
drm_weather_1 = dimensionality_red_methods(X_weather_dummies, 1)
drm_weather_2 = dimensionality_red_methods(X_weather_dummies, 2)
drm_weather_3 = dimensionality_red_methods(X_weather_dummies, 3)

topic = 'weather'
for comp in range(1,4):

    for i, m in enumerate(models):
        X = globals()[f'drm_{topic}_{comp}'][i]
        
        # normalize the data to be in the range [0,1]
        x_min, x_max = np.min(X, 0), np.max(X, 0)
        globals()[f'X{comp}_{topic}_{m}'] = (X - x_min) / (x_max - x_min) 
        
#The above stores the results of the function into variables named like X1_weather_PCA, X2_weather_PCA,
#depending on the number of components and the method followed

Calculate the correlation of the new Weather_Condition related embeddings with Severity.

In [167]:
correlation_of_embeddings_with_severity('weather')
PCA MDS Isomap SpEmb LLE tSNE
0 -0.08 0.03 -0.13 -0.05 0.06 -0.1
PCA MDS Isomap SpEmb LLE tSNE
0 -0.08 0.11 -0.13 -0.05 -0.02 -0.05
1 -0.11 -0.01 0.04 -0.06 -0.06 -0.12
PCA MDS Isomap SpEmb LLE tSNE
0 -0.08 -0.04 -0.13 0.05 -0.02 -0.10
1 -0.11 0.04 0.04 -0.06 0.06 -0.12
2 -0.04 0.09 0.02 -0.08 -0.03 -0.04

Observations:

Embeddings that represent weather condition information have a lower correlation in general than city/state embdeddings.

It can be seen that the 1-component embedding from Isomap has the highest correlation (-0.13) with severity, so we will try that variable for the predictive model (X1_weather_Isomap).

From the 2-component embeddings it is harder to decide based on correlation. Isomap and PCA have probably the best correlation among all. To facilitate the decision regarding the 2 components, scatter plot will be shown after.

From the 3-component embeddings we see that correlations continue to be very low and even lower, so we decide not to use any of these embeddings.


Below plot embeddings of 2 components for Weather Condition:

In [168]:
def scatter_model(row, col, model, model_emb):
    sc = ax[row,col].scatter(model_emb[:,0], model_emb[:,1], s=17,
                c = accidents_clean_df['Severity'].astype('category').cat.codes,  cmap=cmap)
    ax[row,col].set_title(model,  fontsize = 24)
    return sc

fig, ax = plt.subplots(2, 3, figsize=(17,14), sharey=True, sharex=True)

scatter = scatter_model(0,0,'t-SNE', X2_weather_tSNE)     
scatter_model(0,1,'PCA', X2_weather_PCA)
scatter_model(1,0,'Isomap', X2_weather_Isomap)
scatter_model(1,1,'MDS', X2_weather_MDS)
scatter_model(0,2,'LLE', X2_weather_LLE)
scatter_model(1,2,'SpEmb', X2_weather_SpEmb)


ax[0,0].set_xlabel("Embedding 1", size=20)
ax[0,0].set_ylabel("Embedding 2", size=20)
ax[0,0].xaxis.set_label_coords(1.7, -1.3)
ax[0,0].yaxis.set_label_coords( -0.1, -0.14)

# add legend to the plot with names
ax[0,0].legend(handles=scatter.legend_elements()[0], 
           labels=sev,
           title="Severities", bbox_to_anchor=(-0.35, 1.3), loc=2, borderaxespad=0., fontsize =16, title_fontsize=16);

fig.suptitle('Weather Embeddings', fontsize = 24, y =1.0);#Main title

No clear correlation with severity can be seen in any of the above scatter plots.

Based on our analysis before, we select a few of the created embeddings to feed the modelling part:

In [169]:
#Save the embeddings that we want to try in predictive modelling, into a dataframe
IDs = accidents_clean_df.index

embeddings_df  = pd.DataFrame(index = IDs)


embeddings_df['city_emb1_Isomap_1'] , \
\
embeddings_df['city_emb2_LLE_1'], \
embeddings_df['city_emb2_LLE_2'], \
\
embeddings_df['city_emb2_tSNE_1'], \
embeddings_df['city_emb2_tSNE_2'], \
\
embeddings_df['city_emb3_LLE_1'], \
embeddings_df['city_emb3_LLE_2'], \
embeddings_df['city_emb3_LLE_3'], \
\
embeddings_df['weather_emb1_Isomap_1'],\
\
embeddings_df['weather_emb2_PCA_1'], \
embeddings_df['weather_emb2_PCA_2'] =\
\
[X1_city_Isomap,\
 X2_city_LLE[:,0], X2_city_LLE[:,1], \
 X2_city_tSNE[:,0], X2_city_tSNE[:,1], \
 X3_city_LLE[:,0], X3_city_LLE[:,1], X3_city_LLE[:,2], \
\
X1_weather_Isomap, \
X2_weather_PCA[:,0], X2_weather_PCA[:,1] ]

embeddings_df.head()
Out[169]:
city_emb1_Isomap_1 city_emb2_LLE_1 city_emb2_LLE_2 city_emb2_tSNE_1 city_emb2_tSNE_2 city_emb3_LLE_1 city_emb3_LLE_2 city_emb3_LLE_3 weather_emb1_Isomap_1 weather_emb2_PCA_1 weather_emb2_PCA_2
ID
A-2716600 0.450302 0.017496 0.121970 0.538984 0.085713 0.646823 0.017839 0.121959 3.798097e-01 0.293304 6.133364e-01
A-2716601 0.539689 0.017799 0.121433 0.770524 0.182628 0.648222 0.017153 0.121438 3.798097e-01 0.293304 6.133364e-01
A-2716602 0.546339 0.018269 0.121777 0.582919 0.296352 0.646180 0.018893 0.121795 4.092663e-16 0.000000 0.000000e+00
A-2716603 0.546339 0.018262 0.121787 0.582919 0.296352 0.646181 0.018844 0.121783 4.092663e-16 0.000000 1.666635e-16
A-2716604 0.584862 0.016492 0.121781 1.000000 0.261156 0.645893 0.017822 0.121782 3.183182e-16 0.000000 1.666635e-16

5. Model Evaluation

Having explored the data, we would like to see if we can predict the severity of the accidents given the features. We would also like to see which variables are most influential.

Clean data

In [170]:
df['day (int)'] = df['Start_Time'].dt.weekday
df['day (int)'].nunique()

# Change Bool columns to 0 and 1
columns = ['Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
     'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']

for c in columns:
    df[c] = df[c].astype(int)
df.reset_index(drop = True, inplace = True)

Since the data is highly skewed, we will use Precision-recall curves to evaluate the models. Moreover, since the data is imbalanced accuracy will yield misleading results, predicting the majority class, which is a reflection on the distribution. This is known as the accuracy paradox [5]. Therefore we will not be using this measure to compare the models.

To model the data, variables that are not a float or integer will need to be removed, unless turned into dummy variables. In addition, booleans are transformed into integers, 0 and 1.

5.1 Multi-class classification

We first explore the multiclass case. Precision-recall curves are usually used in binary classification. Therefore, in order to use it for multi-classclassification, the output must be binarized. This is done using the python function label_binarize().

From the precision-recall plot produced below, we can immediately see that Severity 2 has the best score, showing that it would yield the most accurate results. This could be because it has far more data points than the other labels. Interestingly, Severity 4 has a much higher score than 3.

In [171]:
y = df['Severity'] 
n_class = [2,3,4]
Y = label_binarize(y, classes=[2,3,4])
X = df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
                      'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
                       'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition', 
                        'State abbreviation'])
X_train, X_test, y_train, y_test = train_test_split(X,Y,
                                                    random_state = 33)

clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=50,
                             max_depth=3,
                             random_state=0))
clf.fit(X_train, y_train)

y_score = clf.predict_proba(X_test)     
In [172]:
# precision recall curve
fig, ax = plt.subplots(figsize = (15,6))
precision = dict()
recall = dict()
for i, l in enumerate(n_class):    
    precision[i], recall[i], _ = precision_recall_curve(y_test[:, i],
                                                            y_score[:, i])
    avg_precision = average_precision_score(y_test[:,i], y_score[:, i])
    plt.plot(recall[i], precision[i], lw=2, label='class {} (area={})'.format(l,round(avg_precision,2)))

    
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(loc="best")
plt.title("Precision-recall curve")
plt.show()

5.2 Binary classification

Next, we evaluate a binary case. We combine Severity 3 and 4, and rename it as 1. This will indicate a very severe accident. Severity 2 will be renamed as 0, and this will indicate a light accident. Combining the labels has reduced the skewness, however there are still almost double the records in category 0 than category 1. Therefore, it makes sense to still use Precision-recall curves to evaluate the model.

In [173]:
df['Severity'].replace({4:1, 3:1, 2:0}, inplace = True)
df['Severity'].unique() 

# Distribution of updated Severity
sns.countplot(data = df, x='Severity')
Out[173]:
<matplotlib.axes._subplots.AxesSubplot at 0x23cc9186610>

We define function that plots precision-recall for each model. The input is X, so that we can see how the results change with different features. We will therefore be able to compare how the different dimension reduction methods perform.

Overall the model that performed best was Random Forest, without any additional variables that were was reduced dimensionally. Interestingly, Random forest has a feature, where we can view which variables contribute most to the model output.

Below we explore which variable is most important using a game theoretic approach, SHapley Additive exPlanations (SHAP).

In [174]:
def get_iterable(x):
    if isinstance(x, list):
        return x
    else:
        return [x]
    
def plot_pr(X, plot_size, title):
    ''' 
    plot pr curves in a single plot
    X are the features in the dataset
    plot_size takes a positive value
    '''
    # Define label
    y = ml_df['Severity']
    
    np.random.seed(55)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.80, random_state=33)

    classifier_names = ["Logistic Regression",
                        "Nearest Neighbors", 
                        "Linear SVM", 
                        "RBF SVM",
                        "Random Forest",
                        "XGBoost"]

    classifiers = [LogisticRegression(C=1., solver='lbfgs'),
                   KNeighborsClassifier(3),
                   SVC(kernel="linear", C=0.025, probability=True),
                   SVC(gamma=2, C=1, probability=True),
                   RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
                   XGBClassifier()]

    pred_prob = {}
    pred = {}

    # fit individual classifiers
    for name, classifier in zip(classifier_names, classifiers):
        np.random.seed(100)
        classifier.fit(X_train,y_train)
        pred_prob[name] = classifier.predict_proba(X_test)[:,1]
        pred[name] = np.where(pred_prob[name] >= 0.5, 1, 0)

    plt.clf()
    fig = plt.figure(figsize=(plot_size,plot_size))
    ax = fig.add_subplot(1, 1, 1)
    
    names = get_iterable(classifier_names)
    for name in names:
        precision, recall, _ = precision_recall_curve(y_test,pred_prob[name])
        avg_precision = average_precision_score(y_test, pred_prob[name])
        ax.plot(recall, precision, lw=2.5, label=name + ' (area = %0.3f)' % avg_precision);
         
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.01])
    plt.title(title);
    plt.legend(loc='center left', bbox_to_anchor=(1.05, 0.5), fontsize=9)
    plt.show(); 

5.3 Feature importance

SHAP plots rank the variables from the most important to the least. Each dot in the plot represents an observation in the dataset. The SHAP value is on the x-axis and indicates how much the change is in log-odds. The variable name is on the y-axis. The gradient colour represents the value for that variable (e,g. high temperature is shown as red whilst a low temperature is blue). Binary variables are either red or blue, which corresponds to a value of 1 and 0 respectively.

From the plot, we can immediately see that Timezone Eastern is the top key variable. The Eastern Time Zone is represented as orange in the map below. Ohio is in within this timezone, and earlier we saw that they had the second highest count of accidents. Furthermore, the plot shows that when there is a low humidity, it is more likely to result in a severe accident. Civil twighlight is a way of indicating whether it is morning or night, depending on the when the geometric center of the sun is 6 degrees below the horizon [6]. The variable has blue positive shap values indicating that accidents in the daytime are likely to lead to a very severe accident.

In [180]:
# Merge severity with embedding spreadsheet
ml_df = df.merge(embeddings_df, on = 'ID')

# Using original features
X1 = df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
                      'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
                       'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition', 
                        'State abbreviation'])

# Test out dimension reduction methods
# Use [['city_emb2_tSNE_1','city_emb2_tSNE_2']] in addition to other variables
X2 = ml_df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
                      'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
                       'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition', 
                        'State abbreviation', 'city_emb1_Isomap_1', 'city_emb2_LLE_1', 'city_emb2_LLE_2',
                'city_emb3_LLE_1','city_emb3_LLE_2', 'city_emb3_LLE_3', 'weather_emb1_Isomap_1','weather_emb2_PCA_1', 
                'weather_emb2_PCA_2'])

# Use 'city_emb1_Isomap_1' and 'weather_emb1_Isomap_1'
X3 = ml_df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
                      'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
                       'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition', 
                        'State abbreviation', 'city_emb2_LLE_1', 'city_emb2_LLE_2',
                'city_emb3_LLE_1','city_emb3_LLE_2', 'city_emb3_LLE_3', 'weather_emb2_PCA_1', 
                'weather_emb2_PCA_2', 'city_emb2_tSNE_1','city_emb2_tSNE_2'])

X4 = ml_df.drop(columns = ['Severity', 'ID', 'Day','Start_Time', 'End_Time','Start_Lat', 'Start_Lng', 'End_Lat',
                      'End_Lng', 'Distance_mi', 'Description', 'Street', 'City', 'County', 'State', 'Zipcode',
                       'Airport_Code', 'Weather_Timestamp', 'Wind_Direction', 'Weather_Condition', 
                        'State abbreviation', 'city_emb2_LLE_1', 'city_emb2_LLE_2',
                'city_emb3_LLE_1','city_emb3_LLE_2', 'city_emb3_LLE_3', 'city_emb2_tSNE_1','city_emb2_tSNE_2',
               'city_emb1_Isomap_1', 'weather_emb1_Isomap_1'])

X = [X1, X2, X3, X4]
titles = ['No dimension reduction', 'tSNE on city', 'Isomap on city and weather', 'PCA on weather']
zipped = zip(titles, X)
dic = dict(zipped)

for key, value in dic.items():    
    plot_pr(value, 8, title=key);
   
[21:35:43] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
<Figure size 432x288 with 0 Axes>
[21:35:44] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
<Figure size 432x288 with 0 Axes>
[21:35:46] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
<Figure size 432x288 with 0 Axes>
[21:35:46] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
<Figure size 432x288 with 0 Axes>
In [181]:
ax, fig = plt.subplots(figsize = (15,6))
y = ml_df['Severity']

X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=.80, random_state=33)

rfc = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1).fit(X_train, y_train)

print("1. confusion matrix:\n")
y_pred = rfc.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

shap_values = shap.TreeExplainer(rfc).shap_values(X_train)
shap.summary_plot(shap_values[1], X_train)
1. confusion matrix:

[[1532   26]
 [ 673   63]]
In [182]:
from IPython.display import Image
Image(filename='us-easterntz.jpg', width =500, height=300)
Out[182]:

6. Main Findings

Overall, the traffic in the roads was not clearly affected by the accidents, nor did we see speed as a reason for accidents happening. However, road conditions, like when amenities are around, had a bigger impact on the traffic consequences of an accident. Therefore, the municipalities of the cities can take this into consideration and put in place strategies to deal with the accidents that occur next to amenities, so that the traffic is not highly affected.

US road accidents are highly correlated with the population of the states, whereby the denser states have more accidents in general than states with a sparser population. Moreover, road accidents often occurred in clear or overcast days.

We have seen that most accidents occur at junctions and in states California and Ohio. Furthermore, most accident occur during rush hour period. During the weekend, most accidents that occur are very severe.

Random Forest has been shown to be the best model from predicting the severity, having an area of 0.603. Using SHAP, we were able to see that the most influential feature was the Eastern Timezone. In future, to get better results, we would balance the data by using a method such as Synthetic Minority Over-sampling Technique (SMOTE).

7. References

[1] https://www.kaggle.com/sobhanmoosavi/us-accidents


[2] Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

[3] Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

[4] Brownlee, J. (2021). Failure of Classification Accuracy for Imbalanced Class Distributions.

[5] Sandy (2019). California Car Accidents: Statistics and Causes

[6] https://www.weather.gov/lmk/twilight-types#:~:text=Civil%20Twilight%3A,horizon%2C%20and%20ends%20at%20sunrise.followed

[7] Seber, G and Lee, A. (2012). Linear Regression Analysis. New Jersey, United States: John Wiley & Sons.

8. Individual Contribution to the project

Alexia Sarantidi

My contribution was the following:

Section Task
Introduction Added picture, half of the introduction text, the Table of Contents
Part 2 ALL (except Web-Scrapping for State names)
Part 3.5 ALL
Part 3.6 ALL
Part 4 ALL
Part 6 half
Part 7 half

As this was a group project, time was also spent in coordinating with my team-members and combining our work.

Team member 2

The contribution was the following:

Section Task
Introduction half of the introduction text
Part 2.2 Web-Scrapping for State names
Part 3.2 ALL
Part 3.3 ALL
Part 3.4 ALL
Part 5 ALL
Part 6 half
Part 7 half

As this was a group project, time was also spent in coordinating with my team-members and combining our work.

Team Member 3

The contribution was the following:

Section Task
Part 2.2 Helped with webscraping
Part 3.1 (Objective 1-5) + Heatmap ALL